ML 01L - Data Exploration Lab(Python)
Loading...

Data Exploration

In this notebook, we will use the dataset we cleansed in the previous lab to do some Exploratory Data Analysis (EDA).

This will help us better understand our data to make a better model.

Spark Logo Tiny In this lesson you:

  • Identify log-normal distributions
  • Build a baseline model and evaluate
%run "../Includes/Classroom-Setup"

Let's keep 80% for the training set and set aside 20% of our data for the test set. We will use the randomSplit method Python/Scala.

We will discuss more about the train-test split later, but throughout this notebook, do your data exploration on trainDF.

filePath = "dbfs:/mnt/training/airbnb/sf-listings/sf-listings-2019-03-06-clean.parquet/"
airbnbDF = spark.read.parquet(filePath)
trainDF, testDF = airbnbDF.randomSplit([.8, .2], seed=42)

Let's make a histogram of the price column to explore it (change the number of bins to 300).

display(trainDF.select("price"))
 
price
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
200
130
95
250
250
115
105
86
100
220
110
130
250
100
350
200
250
299

Showing the first 1000 rows.

Is this a Log Normal distribution? Take the log of price and check the histogram. Keep this in mind for later :).

# TODO
from pyspark.sql.functions import log
 
display(trainDF.select(log("price")))
 
ln(price)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
5.298317366548036
4.867534450455582
4.553876891600541
5.521460917862246
5.521460917862246
4.74493212836325
4.653960350157523
4.454347296253507
4.605170185988092
5.393627546352362
4.700480365792417
4.867534450455582
5.521460917862246
4.605170185988092
5.857933154483459
5.298317366548036
5.521460917862246
5.700443573390686

Showing the first 1000 rows.

Now take a look at how price depends on some of the variables:

  • Plot price vs bedrooms
  • Plot price vs accommodates

Make sure to change the aggregation to AVG.

display(trainDF)
 
host_is_superhost
cancellation_policy
instant_bookable
host_total_listings_count
neighbourhood_cleansed
latitude
longitude
property_type
room_type
accommodates
bathrooms
bedrooms
beds
bed_type
minimum_nights
number_of_reviews
review_scores_rating
review_scores_accuracy
review_scores_cleanliness
review_scores_checkin
review_scores_communication
review_scores_location
review_scores_value
price
bedrooms_na
bathrooms_na
beds_na
review_scores_rating_na
review_scores_accuracy_na
review_scores_cleanliness_na
review_scores_checkin_na
review_scores_communication_na
review_scores_location_na
review_scores_value_na
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
f
flexible
f
0
Diamond Heights
37.7431
-122.44509
House
Private room
2
1
1
1
Real Bed
1
1
100
10
10
10
10
10
10
200
0
0
0
0
0
0
0
0
0
0
f
flexible
f
1
Bayview
37.71178
-122.38762
Apartment
Entire home/apt
3
1
1
1
Real Bed
90
13
88
10
9
8
9
10
10
130
0
0
0
0
0
0
0
0
0
0
f
flexible
f
1
Bayview
37.72241
-122.39829
Guest suite
Entire home/apt
4
1
1
3
Real Bed
1
12
98
10
10
10
10
9
10
95
0
0
0
0
0
0
0
0
0
0
f
flexible
f
1
Bayview
37.72979
-122.37094
Apartment
Entire home/apt
2
1
1
1
Real Bed
180
1
100
10
10
10
10
10
10
250
0
0
0
0
0
0
0
0
0
0
f
flexible
f
1
Bayview
37.73072
-122.38907
House
Entire home/apt
6
3
3
3
Real Bed
30
0
98
10
10
10
10
10
10
250
0
0
0
1
1
1
1
1
1
1
f
flexible
f
1
Bayview
37.7352
-122.38566
House
Private room
2
1
1
1
Real Bed
2
100
96
10
9
10
10
9
10
115
0
0
0
0
0
0
0
0
0
0
f
flexible
f
1
Bernal Heights
37.7326
-122.41423
Condominium
Private room
2
1.5
1
1
Real Bed
2
36
96
10
10
10
10
10
10
105
0
0
0
0
0
0
0
0
0
0
f
flexible
f
1
Bernal Heights
37.73615
-122.41245
House
Private room
2
1
1
2
Real Bed
1
194
91
9
9
10
10
9
9
86
0
0
0
0
0
0
0
0
0
0
f
flexible
f
1
Bernal Heights
37.73765
-122.41247
Apartment
Entire home/apt
4
1
1
2
Real Bed
2
4
95
10
10
10
9
9
10
100
0
0
0
0
0
0
0
0
0
0
f
flexible
f
1
Bernal Heights
37.73826
-122.41693
House
Entire home/apt
4
1
2
2
Real Bed
4
2
100
10
10
10
10
10
10
220
0
0
0
0
0
0
0
0
0
0
f
flexible
f
1
Bernal Heights
37.73994
-122.41502
Guest suite
Entire home/apt
3
1
1
2
Real Bed
30
2
100
10
9
10
10
10
10
110
0
0
0
0
0
0
0
0
0
0
f
flexible
f
1
Bernal Heights
37.7418
-122.41674
House
Private room
3
1
1
1
Real Bed
2
5
100
10
10
10
10
10
10
130
0
0
0
0
0
0
0
0
0
0
f
flexible
f
1
Bernal Heights
37.74191
-122.40843
House
Private room
2
1
1
1
Real Bed
1
1
100
10
10
10
10
10
10
250
0
0
0
0
0
0
0
0
0
0
f
flexible
f
1
Bernal Heights
37.7422
-122.42091
Guest suite
Private room
4
1
1
3
Real Bed
3
49
95
10
10
10
10
10
9
100
0
0
0
0
0
0
0
0
0
0
f
flexible
f
1
Bernal Heights
37.74387
-122.42338
Apartment
Entire home/apt
4
2
3
2
Real Bed
30
10
98
10
10
10
10
10
10
350
0
0
0
0
0
0
0
0
0
0
f
flexible
f
1
Bernal Heights
37.74494
-122.41034
House
Entire home/apt
4
2
2
2
Pull-out Sofa
30
0
98
10
10
10
10
10
10
200
0
0
0
1
1
1
1
1
1
1
f
flexible
f
1
Bernal Heights
37.74552
-122.41195
Apartment
Entire home/apt
2
2
1
1
Real Bed
2
4
100
10
10
10
10
10
10
250
0
0
0
0
0
0
0
0
0
0
f
flexible
f
1
Bernal Heights
37.74605
-122.42209
Guest suite
Entire home/apt
4
1
2
3
Real Bed
2
0
98
10
10
10
10
10
10
299
0
0
0
1
1
1
1
1
1
1

Showing the first 1000 rows.

Let's take a look at the distribution of some of our categorical features

display(trainDF.groupBy("room_type").count())
 
room_type
count
1
2
3
Shared room
154
Entire home/apt
3512
Private room
2114

Showing all 3 rows.

Which neighbourhoods have the highest number of rentals? Display the neighbourhoods and their associated count in descending order.

# TODO
from pyspark.sql.functions import col
 
display(trainDF
  .groupBy("neighbourhood_cleansed").count()
  .orderBy(col("count").desc()))
 
neighbourhood_cleansed
count
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Mission
554
Western Addition
492
South of Market
486
Downtown/Civic Center
439
Castro/Upper Market
318
Bernal Heights
301
Haight Ashbury
286
Noe Valley
259
Outer Sunset
219
Nob Hill
174
Potrero Hill
171
Inner Richmond
152
Marina
148
Excelsior
134
Russian Hill
130
Bayview
125
Inner Sunset
124
Outer Mission
124

Showing all 36 rows.

How much does the price depend on the location?

trainDF.createOrReplaceTempView("trainDF")

We can use displayHTML to render any HTML, CSS, or JavaScript code.

%python
 
from pyspark.sql.functions import col
 
trainDF = spark.table("trainDF")
 
lat_long_price_values = trainDF.select(col("latitude"),col("longitude"),col("price")/600).collect()
 
lat_long_price_strings = [
  "[{}, {}, {}]".format(lat, long, price) 
  for lat, long, price in lat_long_price_values
]
 
v = ",\n".join(lat_long_price_strings)
 
# DO NOT worry about what this HTML code is doing! We took it from Stack Overflow :-)
displayHTML("""
<html>
<head>
 <link rel="stylesheet" href="https://unpkg.com/leaflet@1.3.1/dist/leaflet.css"
   integrity="sha512-Rksm5RenBEKSKFjgI3a41vrjkw4EVPlJ3+OiI65vTjIdo9brlAacEuKOiQ5OFh7cOI1bkDwLqdLw3Zg0cRJAAQ=="
   crossorigin=""/>
 <script src="https://unpkg.com/leaflet@1.3.1/dist/leaflet.js"
   integrity="sha512-/Nsx9X4HebavoBvEBuyp3I7od5tA0UzAxs+j83KgC8PU0kgB4XiK4Lfe4y4cgBtaRJQEIFCW+oC506aPT2L1zw=="
   crossorigin=""></script>
 <script src="https://cdnjs.cloudflare.com/ajax/libs/leaflet.heat/0.2.0/leaflet-heat.js"></script>
</head>
<body>
    <div id="mapid" style="width:700px; height:500px"></div>
  <script>
  var mymap = L.map('mapid').setView([37.7587,-122.4486], 12);
  var tiles = L.tileLayer('http://{s}.tile.osm.org/{z}/{x}/{y}.png', {
    attribution: '&copy; <a href="http://osm.org/copyright">OpenStreetMap</a> contributors',
}).addTo(mymap);
  var heat = L.heatLayer([""" + v + """], {radius: 25}).addTo(mymap);
  </script>
  </body>
  </html>
""")

Baseline Model

Before we build any Machine Learning models, we want to build a baseline model to compare to. We also want to determine a metric to evaluate our model. Let's use RMSE here.

For this dataset, let's build a baseline model that always predict the average price and one that always predicts the median price, and see how we do. Do this in two separate steps:

  1. trainDF: Extract the average and median price from trainDF, and store them in the variables avgPrice and medianPrice, respectively.
  2. testDF: Create two additional columns called avgPrediction and medianPrediction with the average and median price from trainDF, respectively. Call the resulting DataFrame predDF.

Some useful functions: